Reviews
Introduction
Inference as Optimization
Expectation Maximization
MAP Inference and Sparse Coding
Variational Inference and Learning
- Discrete Latent Variables
- Calculus of Variations
- Continuous Latent Variables
- Interactions between Learning and Inference
Learned Approximate Inference
- Wake-Sleep
- Other Forms of Learned Inference

Reviews

Restricted Boltzmann Machine (Ch 16)

Energy-based Model
- $\tilde{p}(\mathbf{x})=\exp(-E(\mathbf{x}))$

Discrete Case of RBM

All $\mathbf{h}_i$, $\mathbf{v}_i$ are 0 or 1

We can extract closed form $p(\mathbf{h}|\mathbf{v})$, $p(\mathbf{v}|\mathbf{h})$
How $p(\mathbf{v})$?

Intractability of computing Partition Functions, Again

Computing $\tilde{p}(\mathbf{v})$ is easy
But How to compute $p(\mathbf{v}) = \cfrac{1}{Z} \tilde{p}(\mathbf{v})$
We need to Compute Partition Function $Z$
Given Visible
- $\mathbf{x}^{(i)}$
Generated "Visible" and Hidden Variables from Gibbs Sampling
- $\tilde{\mathbf{x}}^{(i)}$, $\mathbf{h}^{(j)}$
Variables: $W$, $\mathbf{b}$, $\mathbf{c}$
- Calculating Gradients of these.

Introduction

The Chanllege of inference usually refers to the difficult problem of computing $p(h|v)$ or taking expectations with respect to it

$p(v)$, $p(h|v)$, $p(v|h)$ are important! for the next explanation

19.1 Inference as Optimization

What we want to know $\log p(v;\theta)$
- Inference of $\log p(v;\theta)$ is intractable
- We get Lower Bound $L$ instead of it
- Consider hidden variables $h$
- Induce new probability $q$

$L$ is tighter
- => $q(h|v)$ are better approximations of $p(h|v)$
- $q(h|v) == p(h|v)$ => $L = \log p(v;\theta)$

19.2 Expectation Maximization

A Popular Training Algorithm for models with Latent Variables.
- K-Mean
Not Approximate Inference
Approximate Posterior
Stochastic gradient ascent on latent variable models can be seen as a special case of the EM algorithm
- where the M step consists of taking a single gradient step.

19.3 MAP Inference and Sparse Coding

19.4 Variational Inference and Learning

"VARIATIONAL"
- Use function as parameters
- function $q$ in $\mathcal{L}(\mathbf{v}, \theta, q)$
The Core Idea
- Maximize $\mathcal{L}$ over a restricted family of distributions q.
  - Not just control $\theta$
  - Control a family of $q$
- Impose the restriction on $q$
Mean field Approach
- $q(\mathbf{h}|\mathbf{v}) = \prod_{i}q(\mathbf{h}_i|\mathbf{v})$
- to maximize $\log p(\mathbf{v};\theta)-D_{KL}(q(\mathbf{h}|\mathbf{v})||p(\mathbf{h}|\mathbf{v};\theta))$

19.4.1 Discrete Latent Variables

Binary sparce coding model case

$\mathbf{h}_i$ is binary.
set $\hat{h}_i = q(\mathbf{h}_i=1|\mathbf{v})$
- $1 - \hat{h}_i = q(\mathbf{h}_i=0|\mathbf{v})$
$p(h_i = 1) = \sigma(b_i)$
$p(v|h) = \mathcal{N}(v;Wh, \beta^{-1})$
Target
- $p(v)$ Only!!!
- Use $h$
  - generated by $\sigma(b_i)$
  - $b_i$ is key!!!

Find $b_i$ to maximize $p(v)$
- We need $p(h|v)$

$p(h|v) \approx q(h|v) = \Pi_i q(h_i|v)$
- Remove $p(h|v)$ and insert $\Pi_i q(h_i|v)$
- set $\hat{h}_i = q(\mathbf{h}_i=1|\mathbf{v})$
- set $1 - \hat{h}_i = q(\mathbf{h}_i=0|\mathbf{v})$

Fixed point update, equation
- for $\cfrac{\partial}{\partial \hat{h}_i} \mathcal{L}(v, \theta, \hat{h}) =0$
- $\hat{h}_i^t = f(\hat{h}_j^{t-1})$
- $\hat{h}_j^t = f(\hat{h}_i^{t-1})$
- After iterations, the values converge true-answer

RNN
- Calculate $\hat{h}_i$ using other $\hat{h}_j$

19.4.2 Calculus of Variations

19.4.3 Continuous Latent Variables

We can find $\tilde{q}$s maximizing $\mathcal{L}(v,\theta, q)$

Induction based on Kevin Murphy's Book

Assumption for simplicity
- $h \in \mathbb{R}^2$ -> $i = 1,2$
- $p(h) = \mathcal{N}(h;0, \mathbf{I})$
- $p(v|h) = \mathcal{N}(v;W^Th,1)$
Find Compatibility, $\tilde{p}$ (Unnormalized Probability)

Reduce Notations
- $\langle h_2 \rangle=\mathbb{E}_{h_2 \sim q(h|v)}[h_2]$
- $\langle h^2_2 \rangle=\mathbb{E}_{h_2 \sim q(h|v)}[h^2_2]$

We have proved $\tilde q$ is Gaussian, $q$ is also Gaussian.
- set new $q=\mathcal{N}(h;\mu, \beta^{-1})$
- Find $\mu, \beta$ by using traditional oprimization method
- $\mu, \beta$ are parameters for variational approximate
- $w$ is a parameter for learning process

Overall Process

main-loop for gradient update

approximate inference-loop
update $q_i$s

loop for-MCMC for partition function
sampling

update gradient

19.4.4 Interactions between Learning and Inference

Approximate inference <-> Learning process
Final Goal is to maximize $p(v,h)$
Intermediate Goal is to maxmize $\mathbb{E}_{h\sim q} \log p(v,h)$
Modality difference can cause bad approximate
Need to calculate the difference between $\log p(v;\theta)$ and $\mathcal{L}(v, \theta, q)$ for checking approximate quality

19.5 Learned Approximate Inference

Optimization via iterative procedures such as fixed-point equations is often very expensive and time-consuming.
find inference network $\hat{f}(v;\theta) \approx q$

19.5.1 Wake-Sleep

awake -> update $\theta$
asleep -> update $\hat{f}$
main-loop
1. wake-loop
  update $\hat{f}$
2. sleep-loop
  update gradient
c.f. mean-field approximation loop
main-loop for gradient update
1. approximate inference-loop
  update $q_i$s
2. loop for-MCMC for partition function
  sampling
3. update gradient

19.5.2 Other Forms of Learned Inference

Learned approximate inference has recently become one of the dominant approaches to generative modeling
- In the form of the variational autoencoder.



In [ ]:

Contents

Reviews

Restricted Boltzmann Machine (Ch 16)

Discrete Case of RBM

Intractability of computing Partition Functions, Again

Introduction

19.1 Inference as Optimization

19.2 Expectation Maximization

19.3 MAP Inference and Sparse Coding

19.4 Variational Inference and Learning

19.4.1 Discrete Latent Variables

Binary sparce coding model case

19.4.2 Calculus of Variations

19.4.3 Continuous Latent Variables

Overall Process

19.4.4 Interactions between Learning and Inference

19.5 Learned Approximate Inference

19.5.1 Wake-Sleep

19.5.2 Other Forms of Learned Inference